R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

##set.seed(1)
gliders<-read.csv("~/Desktop/IMOS_-_Australian_National_Facility_for_Ocean_Gliders_(ANFOG)_-_delayed_mode_glider_deployments.csv", skip = 41, header = TRUE)

目前为止做了四个部分: 第一个部分是清除invalid data (line57-69) 第二个是understand the dataset(The percentage of good data and bad data for each variable, number of missing value for each variable,(line 144-252),find the potential relationship between response vs each variable (line254,318)) 第三个部分是calculate the correlation matrix and create correlation plot (line136-142) 第四个部分是find the response distribution (line 72-134)

str(gliders)
## 'data.frame':    3123117 obs. of  58 variables:
##  $ FID                      : chr  "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d76" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d75" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d74" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d73" ...
##  $ file_id                  : int  185 185 185 185 185 185 185 185 185 185 ...
##  $ deployment_name          : chr  "TwoRocks20130215" "TwoRocks20130215" "TwoRocks20130215" "TwoRocks20130215" ...
##  $ platform_type            : chr  "slocum glider" "slocum glider" "slocum glider" "slocum glider" ...
##  $ platform_code            : chr  "SL248" "SL248" "SL248" "SL248" ...
##  $ time_coverage_start      : chr  "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" ...
##  $ time_coverage_end        : chr  "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" ...
##  $ TIME                     : chr  "2013-03-06T22:18:19Z" "2013-03-06T22:18:22Z" "2013-03-06T22:18:23Z" "2013-03-06T22:18:26Z" ...
##  $ TIME_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE                 : num  -31.8 -31.8 -31.8 -31.8 -31.8 ...
##  $ LATITUDE_quality_control : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ LONGITUDE                : num  115 115 115 115 115 ...
##  $ LONGITUDE_quality_control: int  8 8 8 8 8 8 8 8 8 8 ...
##  $ PRES                     : num  32.8 33.1 33.4 33.7 34 ...
##  $ PRES_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DEPTH                    : num  32.5 32.8 33.1 33.5 33.8 ...
##  $ DEPTH_quality_control    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PROFILE                  : int  4907 4907 4907 4907 4907 4907 4907 4907 4907 4907 ...
##  $ PROFILE_quality_control  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PHASE                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PHASE_quality_control    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TEMP                     : num  23.8 23.8 23.8 23.8 23.8 ...
##  $ TEMP_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PSAL                     : num  35.3 35.3 35.3 35.3 35.3 ...
##  $ PSAL_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DOX1                     : num  201 201 201 201 201 ...
##  $ DOX1_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DOX2                     : num  197 197 197 197 197 ...
##  $ DOX2_quality_control     : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ CPHL                     : num  0.255 0.258 0.263 0.262 0.258 ...
##  $ CPHL_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ CDOM                     : num  0.889 0.846 0.845 0.955 0.801 ...
##  $ CDOM_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ CNDC                     : num  5.23 5.23 5.23 5.23 5.23 ...
##  $ CNDC_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ VBSC                     : num  4e-04 4e-04 5e-04 8e-04 5e-04 4e-04 4e-04 4e-04 4e-04 7e-04 ...
##  $ VBSC_quality_control     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ NTRA                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NTRA_quality_control     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ UCUR                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UCUR_quality_control     : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ VCUR                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ VCUR_quality_control     : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ HEAD                     : num  300 300 302 302 304 ...
##  $ HEAD_quality_control     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UCUR_GPS                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UCUR_GPS_quality_control : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ VCUR_GPS                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ VCUR_GPS_quality_control : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ IRRAD443                 : num  0.272 0.264 0.258 0.258 0.258 ...
##  $ IRRAD443_quality_control : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ IRRAD490                 : num  0.354 0.346 0.34 0.338 0.338 ...
##  $ IRRAD490_quality_control : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ IRRAD555                 : num  0.0712 0.0675 0.0675 0.0666 0.0634 0.06 0.0609 0.0618 0.0609 0.0557 ...
##  $ IRRAD555_quality_control : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ IRRAD670                 : num  0.0114 0.0133 0.0128 0.0104 0.0104 0.0123 0.0114 0.0104 0.0109 0.0133 ...
##  $ IRRAD670_quality_control : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ geom                     : chr  "POINT (114.98013440372972 -31.802694506260686)" "POINT (114.98013265657173 -31.802693721740056)" "POINT (114.98013155928385 -31.802693229028474)" "POINT (114.98012962223838 -31.802692359243295)" ...
attach(gliders)
#install.packages("visdat")
library("visdat")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#install.packages("gamlss")
library(gamlss)
## Loading required package: splines
## Loading required package: gamlss.data
## 
## Attaching package: 'gamlss.data'
## The following object is masked from 'package:datasets':
## 
##     sleep
## Loading required package: gamlss.dist
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## Loading required package: parallel
##  **********   GAMLSS Version 5.2-0  **********
## For more on GAMLSS look at https://www.gamlss.com/
## Type gamlssNews() to see new features/changes/bug fixes.
library(gamlss.dist)
#install.packages("gamlss.add")
library(mgcv)
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
library(gamlss.add)
## Loading required package: nnet
## 
## Attaching package: 'nnet'
## The following object is masked from 'package:mgcv':
## 
##     multinom
## Loading required package: rpart
library("MASS")
#install.packages("goftest")
library("goftest")
library(fitdistrplus)
## Loading required package: survival
library("corrplot")
## corrplot 0.84 loaded

Missing data distribution

gliders%>%
  sample_n(100000) %>%
  vis_miss(warn_large_data = FALSE)

We will check the data and delete the invalid data

#library(dplyr)
#count(PSAL >= 2 & PSAL <= 41)
gliders_valid<-gliders[(PSAL >= 2 & PSAL <= 41),]
#count(gliders_valid$CPHL>= 0&gliders_valid$CPHL<=100)
gliders_valid<-gliders_valid[(gliders_valid$CPHL>= 0&gliders_valid$CPHL<=100),]
#count(gliders_valid$CDOM >=0 & gliders_valid$CDOM <= 400)
gliders_valid<-gliders_valid[(gliders_valid$CDOM>=0& gliders_valid$CDOM <= 400),]
#count(gliders_valid$VBSC>=0& gliders_valid$VBSC <= 0.1)
gliders_valid<-gliders_valid[(gliders_valid$VBSC>=0 & gliders_valid$VBSC <= 0.1),]
#count(gliders_valid$IRRAD555>=0& gliders_valid$IRRAD555 <= 1000)
gliders_valid<-gliders_valid[(gliders_valid$IRRAD555>=0& gliders_valid$IRRAD555 <= 1000),]

We look at the response distribution. However, it does not belongs to normal, gamma, lognormal,logistic or beta distribution since the p-value is small. We decide to check the distribution when we are doing modeling. We will compare the MSE and other method to check the accuracy for our model in different distribution.

plotdist(gliders_valid$CPHL)

We first check if it is a gamma distribution, the density plot since to be correct

#install.packages("fitdistrplus")
library("fitdistrplus")
library(ggplot2)
summary(gliders_valid$CPHL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.4     0.5     0.6     0.7    15.9 1514906
plot(density(gliders_valid$CPHL, na.rm = TRUE), xlim = c(0,2))
CPHL_positive <- gliders_valid[(gliders_valid$CPHL >0),]
CPHL_positive_value <- CPHL_positive$CPHL
fit_CPHL<-fitdistr(na.omit(CPHL_positive_value), "gamma") 
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
rand.gamma <- rgamma(100000,
shape = fit_CPHL$estimate[1],
rate = fit_CPHL$estimate[2])
lines(density(rand.gamma),col="red")

CPHL_value<-na.omit(CPHL_positive_value)
CPHL_vector<-c(CPHL_value)

however for gamma distribution, p-value is too small, we reject the hypothesis

fit_CPHL<-fitdist(CPHL_vector, "gamma") 
plot(fit_CPHL)

cvm.test(na.omit(CPHL_positive_value),"pgamma",shape = fit_CPHL$estimate[1],rate = fit_CPHL$estimate[2])
## 
##  Cramer-von Mises test of goodness-of-fit
##  Null hypothesis: Gamma distribution
##  with parameters shape = 3.14158982307635, rate = 5.5431097486641
##  Parameters assumed to be fixed
## 
## data:  na.omit(CPHL_positive_value)
## omega2 = 670.62, p-value < 2.2e-16

This diagram shows us which distribution might be for the response variable. Since the observation and boostrapped values are mainly landed at the bottom. Based on the diagram, we cannot have a final decision.

descdist(CPHL_vector,boot = 100, discrete = FALSE)

## summary statistics
## ------
## min:  1e-04   max:  15.9128 
## median:  0.5122 
## mean:  0.5667723 
## estimated sd:  0.310692 
## estimated skewness:  2.51739 
## estimated kurtosis:  48.08185

p-value too small, therefore not a beta distribution

fit_CPHL_beta<-fitdist(CPHL_vector/100, "beta") 
#plot(fit_CPHL_beta)
cvm.test(na.omit(CPHL_positive_value),"pbeta",shape1 = fit_CPHL_beta$estimate[1],shape2 = fit_CPHL_beta$estimate[2])
## 
##  Cramer-von Mises test of goodness-of-fit
##  Null hypothesis: beta distribution
##  with parameters shape1 = 3.12650554093763, shape2 = 548.647476092144
##  Parameters assumed to be fixed
## 
## data:  na.omit(CPHL_positive_value)
## omega2 = 527344, p-value < 2.2e-16

p-value too small, not a normal distribution

ks.test(CPHL_vector,"pnorm")
## Warning in ks.test(CPHL_vector, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  CPHL_vector
## D = 0.51268, p-value < 2.2e-16
## alternative hypothesis: two-sided

p-value too small, not a log normal distribution

ks.test(CPHL_vector,"plnorm")
## Warning in ks.test(CPHL_vector, "plnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  CPHL_vector
## D = 0.4169, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(CPHL_vector,"plogis")
## Warning in ks.test(CPHL_vector, "plogis"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  CPHL_vector
## D = 0.50305, p-value < 2.2e-16
## alternative hypothesis: two-sided
#fit <- fitDist(CPHL_vector, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)
#summary(fit)

We will then look at the correlation, calculate the correlation matrix and draw a correlation plot

cor_gliders<-gliders_valid[,c(10,12,14,16,18,20,22,24,26,28,30,32,34,36,44,50,52,54,56)]
cor_gliders<-na.omit(cor_gliders)
cor(cor_gliders)
##              LATITUDE    LONGITUDE         PRES        DEPTH      PROFILE
## LATITUDE   1.00000000 -0.912478864 -0.121970430 -0.121534554  0.557930174
## LONGITUDE -0.91247886  1.000000000  0.108698255  0.108329000 -0.652572700
## PRES      -0.12197043  0.108698255  1.000000000  0.999999844 -0.008701475
## DEPTH     -0.12153455  0.108329000  0.999999844  1.000000000 -0.008467876
## PROFILE    0.55793017 -0.652572700 -0.008701475 -0.008467876  1.000000000
## PHASE      0.05035957 -0.054431385  0.028164884  0.028188523  0.034240325
## TEMP       0.84783867 -0.908898705 -0.154921418 -0.154568119  0.794417359
## PSAL      -0.06272787  0.415782393  0.131589931  0.131624916 -0.257238698
## DOX1      -0.86853465  0.819826205 -0.080381377 -0.080765062 -0.401137404
## DOX2      -0.87696424  0.823745572 -0.055059349 -0.055445942 -0.403082804
## CPHL      -0.10062015  0.013705696 -0.095432061 -0.095433453 -0.084960865
## CDOM      -0.06980283  0.110778218  0.053106574  0.053105762 -0.052997403
## CNDC       0.85484999 -0.885696031 -0.139113213 -0.138751091  0.790885100
## VBSC       0.15764057 -0.220949160 -0.022081527 -0.022002782  0.182499741
## HEAD      -0.11255137  0.173964611 -0.060476018 -0.060552838 -0.268255714
## IRRAD443   0.03539622  0.011403635 -0.305939677 -0.305955245 -0.010751948
## IRRAD490   0.03359733  0.011762414 -0.308475197 -0.308488954 -0.013329206
## IRRAD555   0.02067306  0.001297842 -0.290644264 -0.290667429 -0.014633509
## IRRAD670   0.01084460 -0.009899544 -0.142553594 -0.142570719 -0.018869961
##                  PHASE        TEMP        PSAL        DOX1        DOX2
## LATITUDE   0.050359573  0.84783867 -0.06272787 -0.86853465 -0.87696424
## LONGITUDE -0.054431385 -0.90889871  0.41578239  0.81982621  0.82374557
## PRES       0.028164884 -0.15492142  0.13158993 -0.08038138 -0.05505935
## DEPTH      0.028188523 -0.15456812  0.13162492 -0.08076506 -0.05544594
## PROFILE    0.034240325  0.79441736 -0.25723870 -0.40113740 -0.40308280
## PHASE      1.000000000  0.04378828 -0.02585608 -0.06598361 -0.06288317
## TEMP       0.043788281  1.00000000 -0.27582846 -0.64556840 -0.65821875
## PSAL      -0.025856076 -0.27582846  1.00000000  0.13688353  0.12220988
## DOX1      -0.065983608 -0.64556840  0.13688353  1.00000000  0.99420059
## DOX2      -0.062883168 -0.65821875  0.12220988  0.99420059  1.00000000
## CPHL       0.017197357 -0.10380024 -0.18050703  0.03483310  0.04364376
## CDOM       0.006706789 -0.08867170  0.10085210  0.07748184  0.08168960
## CNDC       0.041693575  0.99581222 -0.18789887 -0.63996695 -0.65401560
## VBSC       0.046207014  0.21489991 -0.16728267 -0.10022371 -0.08418186
## HEAD      -0.039380213 -0.21354917  0.08663419  0.09434946  0.09384144
## IRRAD443   0.015998630  0.03729050  0.07913636  0.13722725  0.12644289
## IRRAD490   0.021095901  0.03377881  0.07307605  0.14105569  0.13064252
## IRRAD555   0.005536004  0.03104157  0.02163915  0.12764225  0.11854730
## IRRAD670  -0.020238699  0.01796749 -0.01379555  0.04145448  0.03654562
##                  CPHL         CDOM        CNDC         VBSC         HEAD
## LATITUDE  -0.10062015 -0.069802827  0.85484999  0.157640574 -0.112551366
## LONGITUDE  0.01370570  0.110778218 -0.88569603 -0.220949160  0.173964611
## PRES      -0.09543206  0.053106574 -0.13911321 -0.022081527 -0.060476018
## DEPTH     -0.09543345  0.053105762 -0.13875109 -0.022002782 -0.060552838
## PROFILE   -0.08496087 -0.052997403  0.79088510  0.182499741 -0.268255714
## PHASE      0.01719736  0.006706789  0.04169358  0.046207014 -0.039380213
## TEMP      -0.10380024 -0.088671695  0.99581222  0.214899914 -0.213549169
## PSAL      -0.18050703  0.100852097 -0.18789887 -0.167282668  0.086634188
## DOX1       0.03483310  0.077481840 -0.63996695 -0.100223714  0.094349460
## DOX2       0.04364376  0.081689603 -0.65401560 -0.084181860  0.093841439
## CPHL       1.00000000  0.148703918 -0.12478743  0.294088672 -0.036607286
## CDOM       0.14870392  1.000000000 -0.08014901  0.222915604  0.008238268
## CNDC      -0.12478743 -0.080149008  1.00000000  0.205110388 -0.211483736
## VBSC       0.29408867  0.222915604  0.20511039  1.000000000 -0.057866663
## HEAD      -0.03660729  0.008238268 -0.21148374 -0.057866663  1.000000000
## IRRAD443  -0.29121472 -0.011593113  0.04382933 -0.036704461 -0.001917468
## IRRAD490  -0.27350904 -0.010331754  0.03962314 -0.034630445 -0.003736401
## IRRAD555  -0.25021044 -0.010795613  0.03220191 -0.024689586 -0.010992031
## IRRAD670  -0.12444173 -0.005522851  0.01624218 -0.002849081 -0.008791674
##               IRRAD443     IRRAD490     IRRAD555     IRRAD670
## LATITUDE   0.035396219  0.033597327  0.020673062  0.010844602
## LONGITUDE  0.011403635  0.011762414  0.001297842 -0.009899544
## PRES      -0.305939677 -0.308475197 -0.290644264 -0.142553594
## DEPTH     -0.305955245 -0.308488954 -0.290667429 -0.142570719
## PROFILE   -0.010751948 -0.013329206 -0.014633509 -0.018869961
## PHASE      0.015998630  0.021095901  0.005536004 -0.020238699
## TEMP       0.037290499  0.033778810  0.031041570  0.017967493
## PSAL       0.079136356  0.073076054  0.021639150 -0.013795553
## DOX1       0.137227253  0.141055692  0.127642250  0.041454478
## DOX2       0.126442885  0.130642520  0.118547299  0.036545624
## CPHL      -0.291214717 -0.273509041 -0.250210436 -0.124441731
## CDOM      -0.011593113 -0.010331754 -0.010795613 -0.005522851
## CNDC       0.043829330  0.039623137  0.032201906  0.016242177
## VBSC      -0.036704461 -0.034630445 -0.024689586 -0.002849081
## HEAD      -0.001917468 -0.003736401 -0.010992031 -0.008791674
## IRRAD443   1.000000000  0.995043435  0.958353383  0.534655413
## IRRAD490   0.995043435  1.000000000  0.949908054  0.512148984
## IRRAD555   0.958353383  0.949908054  1.000000000  0.664028520
## IRRAD670   0.534655413  0.512148984  0.664028520  1.000000000
corrplot(corr=cor(cor_gliders),method = "color",tl.col="black")

calculate the percantage of good data, bad data

hist(gliders_valid$TIME_quality_control) # all good data

hist(gliders_valid$LATITUDE_quality_control) # 1.5% good data, rest are Interpolated value

hist(gliders_valid$LONGITUDE_quality_control) # 1.5% good data, rest are Interpolated value

hist(gliders_valid$PRES_quality_control) # all good data

hist(gliders_valid$DEPTH_quality_control)# all good data

hist(gliders_valid$PROFILE_quality_control) # all No QC performed data

hist(gliders_valid$PHASE_quality_control)# all No QC performed data

hist(gliders_valid$TEMP_quality_control)# all good data

hist(gliders_valid$PSAL_quality_control)# all good data

hist(gliders_valid$DOX1_quality_control)# all good data

hist(gliders_valid$DOX2_quality_control)# 81% good data, 11.3% Bad data that are potentially correctable, 8% Missing value

hist(gliders_valid$CPHL_quality_control)# all good data

hist(gliders_valid$CDOM_quality_control)# all good data

hist(gliders_valid$CNDC_quality_control)# all good data

hist(gliders_valid$VBSC_quality_control)# 86% good data, 9% Bad data that are potentially correctable, 5% bad data

hist(gliders_valid$UCUR_quality_control)#  missing value

hist(gliders_valid$VCUR_quality_control)#  missing value

hist(gliders_valid$HEAD_quality_control)# 99.9% No QC performed data

hist(gliders_valid$UCUR_GPS_quality_control)# missing value

hist(gliders_valid$VCUR_GPS_quality_control)# missing value

hist(gliders_valid$IRRAD443_quality_control) # 55%  good data, 45%  bad data

hist(gliders_valid$IRRAD490_quality_control) # 55%  good data, 45%  bad data

hist(gliders_valid$IRRAD555_quality_control) # 55%  good data, 45%  bad data

hist(gliders_valid$IRRAD670_quality_control) # 55%  good data, 45%  bad data

specific calculation of the percentage for each variable’s missing data, good data and bad data

table(gliders_valid$LATITUDE_quality_control)
## 
##       1       8       9 
##   23701 1562349     366
23701/(23701+1562349+366)
## [1] 0.01493997
366/(23701+1562349+366)
## [1] 0.0002307087
table(gliders_valid$LONGITUDE_quality_control)
## 
##       1       8       9 
##   23701 1562349     366
table(gliders_valid$PRES_quality_control)
## 
##       1 
## 1586416
table(gliders_valid$PROFILE_quality_control)
## 
##       0       9 
## 1586399      17
table(gliders_valid$PHASE_quality_control)
## 
##       0       9 
## 1586399      17
table(gliders_valid$DOX2_quality_control)# almost good data
## 
##       0       1       3       4       9 
##    1930 1277617  179281     703  126885
126885/(1930+1277617+179281+703+126885)
## [1] 0.07998217
table(gliders_valid$VBSC_quality_control)# almost good data
## 
##       0       1       3       4 
##    2230 1368673  139038   76475
76475/(2230+1368673+139038+76475)
## [1] 0.04820615
table(gliders_valid$UCUR_quality_control)
## 
##       0       9 
##     777 1585639
table(gliders_valid$VCUR_quality_control)
## 
##       0       9 
##     777 1585639
table(gliders_valid$HEAD_quality_control)
## 
##       0       9 
## 1584749    1667
1584749/(1584749+1667)
## [1] 0.9989492
table(gliders_valid$UCUR_GPS_quality_control)
## 
##       0       9 
##     777 1585639
table(gliders_valid$VCUR_GPS_quality_control)
## 
##       0       9 
##     777 1585639
table(gliders_valid$IRRAD443_quality_control)
## 
##      1      4 
## 863269 723147
table(gliders_valid$IRRAD490_quality_control)
## 
##      1      4 
## 863257 723159
table(gliders_valid$IRRAD555_quality_control)
## 
##      1      4 
## 863317 723099
table(gliders_valid$IRRAD670_quality_control)
## 
##      1      4 
## 862798 723618

One platform type four different timings in dataset

table(gliders_valid$platform_type)
## 
## slocum glider 
##       1586416
table(gliders_valid$time_coverage_start)
## 
## 2013-02-15T03:13:29Z 2013-10-31T01:16:21Z 2014-08-08T02:48:06Z 
##               202337               539721               685764 
## 2014-10-17T00:40:46Z 
##               158594
table(gliders_valid$time_coverage_end)
## 
## 2013-03-11T20:14:20Z 2013-11-13T05:44:04Z 2014-08-24T22:39:08Z 
##               202337               539721               685764 
## 2014-11-06T22:18:12Z 
##               158594

Nothing inside NTRA_quality_control

There are four different deployments in dataset, the number for each deployment is vary.StormBay2014017 has the least and TwoRocks20140808 has the most number.

table(gliders_valid$NTRA_quality_control)
## < table of extent 0 >
table(gliders_valid$deployment_name)
## 
## SpencerGulf20131031    StormBay20141017    TwoRocks20130215    TwoRocks20140808 
##              539721              158594              202337              685764

Trying to find potential relationship between response vs each variable and fits a cubic smoothing spline to the supplied data(cubic smoothing spline 类似于line of best fit). CPHL vs Temp: Peak is around degree at 20. It starts to decrease when the temperature is over 20. When it is around 21, the CPHL tends to be 0. PRES: Spread evenly, can’t see any strong relationship

mySpline<-smooth.spline(na.omit(gliders_valid$TEMP),na.omit(gliders_valid$CPHL))
plot(gliders_valid$TEMP,gliders_valid$CPHL)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)

myPres<-smooth.spline(na.omit(gliders_valid$PRES),na.omit(gliders_valid$CPHL))
plot(gliders_valid$PRES,gliders_valid$CPHL)
lines(myPres$x, myPres$y, col="red", lwd = 2)

Depth: spread evenly, can’t see much relationship.

depth<-data.frame(gliders_valid$DEPTH,gliders_valid$CPHL)
depth<-na.omit(depth)
myDepth<-smooth.spline(depth$gliders_valid.DEPTH,depth$gliders_valid.CPHL)
plot(gliders_valid$DEPTH,gliders_valid$CPHL)
lines(myDepth$x, myDepth$y, col="red", lwd = 2)

PSAL: most data is between 34-36, there are four data below 34.Two peak achieve when it is around 35-36

myPsal<-smooth.spline(na.omit(gliders_valid$PSAL),na.omit(gliders_valid$CPHL))
plot(gliders_valid$PSAL,gliders_valid$CPHL)
lines(myPsal$x, myPsal$y, col="red", lwd = 2)

plot(gliders_valid$LATITUDE,gliders_valid$CPHL)

plot(gliders_valid$LONGITUDE,gliders_valid$CPHL)

plot(gliders_valid$PROFILE,gliders_valid$CPHL)

Phase data only at 0 1 3 4

phase<-data.frame(gliders_valid$PHASE,gliders_valid$CPHL)
phase<-na.omit(phase)
myPhase<-smooth.spline(phase$gliders_valid.PHASE,phase$gliders_valid.CPHL)
plot(gliders_valid$PHASE,gliders_valid$CPHL)
lines(myPhase$x, myPhase$y, col="red", lwd = 2)

myDox1<-smooth.spline(na.omit(gliders_valid$DOX1),na.omit(gliders_valid$CPHL))
plot(gliders_valid$DOX1,gliders_valid$CPHL)
lines(myDox1$x, myDox1$y, col="red", lwd = 2)

DOX2: peak appears around 190, it decreases after and reach around 0 when DOX2 is over 220 CDOM: more data around 0-50, based on the smoothing spline there is a potential positive relationship between two

dox2<-data.frame(gliders_valid$DOX2,gliders_valid$CPHL)
dox2<-na.omit(dox2)
myDox2<-smooth.spline(dox2$gliders_valid.DOX2,dox2$gliders_valid.CPHL)
plot(gliders_valid$DOX2,gliders_valid$CPHL)
lines(myDox2$x, myDox2$y, col="red", lwd = 2)

myCDOM<-smooth.spline(na.omit(gliders_valid$CDOM),na.omit(gliders_valid$CPHL))
plot(gliders_valid$CDOM,gliders_valid$CPHL)
lines(myCDOM$x, myCDOM$y, col="red", lwd = 2)

CNDC: peak around 4.8, tends to be 0 at 4.2 and 5.0. VBSC: CPHL increase when the VBSC is around 0.000-0.006 have a sudden drop at 0.006 at tends to be 0 HEAD: spread evenly, but highest value is around head equals to 250 IRRAD443, IRRAD555,IRRAD490 have the similar diagram, appears to have an inverse relationship

cndc<-data.frame(gliders_valid$CNDC,gliders_valid$CPHL)
cndc<-na.omit(cndc)
mycndc<-smooth.spline(cndc$gliders_valid.CNDC,cndc$gliders_valid.CPHL)
plot(gliders_valid$CNDC,gliders_valid$CPHL)
lines(mycndc$x, mycndc$y, col="red", lwd = 2)

vbsc<-data.frame(gliders_valid$VBSC,gliders_valid$CPHL)
vbsc<-na.omit(vbsc)
myvbsc<-smooth.spline(vbsc$gliders_valid.VBSC,vbsc$gliders_valid.CPHL)
plot(gliders_valid$VBSC,gliders_valid$CPHL)
lines(myvbsc$x, myvbsc$y, col="red", lwd = 2)

head<-data.frame(gliders_valid$HEAD,gliders_valid$CPHL)
head<-na.omit(head)
myhead<-smooth.spline(head$gliders_valid.HEAD,head$gliders_valid.CPHL)
plot(gliders_valid$HEAD,gliders_valid$CPHL)
lines(myhead$x, myhead$y, col="red", lwd = 2)

irrd<-data.frame(gliders_valid$IRRAD443,gliders_valid$CPHL)
irrd<-na.omit(irrd)
myirrd<-smooth.spline(irrd$gliders_valid.IRRAD443,irrd$gliders_valid.CPHL)
plot(gliders_valid$IRRAD443,gliders_valid$CPHL)
lines(myirrd$x, myirrd$y, col="red", lwd = 2)

irrad<-data.frame(gliders_valid$IRRAD555,gliders_valid$CPHL)
irrad<-na.omit(irrad)
myirrad<-smooth.spline(irrad$gliders_valid.IRRAD555,irrad$gliders_valid.CPHL)
plot(gliders_valid$IRRAD555,gliders_valid$CPHL)
lines(myirrad$x, myirrad$y, col="red", lwd = 2)

irad<-data.frame(gliders_valid$IRRAD490,gliders_valid$CPHL)
irad<-na.omit(irad)
myirad<-smooth.spline(irad$gliders_valid.IRRAD490,irad$gliders_valid.CPHL)
plot(gliders_valid$IRRAD490,gliders_valid$CPHL)
lines(myirad$x, myirad$y, col="red", lwd = 2)

understand the dataset

summary(gliders_valid)
##      FID               file_id        deployment_name    platform_type     
##  Length:3101322     Min.   :185.0     Length:3101322     Length:3101322    
##  Class :character   1st Qu.:188.0     Class :character   Class :character  
##  Mode  :character   Median :189.0     Mode  :character   Mode  :character  
##                     Mean   :188.2                                          
##                     3rd Qu.:189.0                                          
##                     Max.   :190.0                                          
##                     NA's   :1514906                                        
##  platform_code      time_coverage_start time_coverage_end      TIME          
##  Length:3101322     Length:3101322      Length:3101322     Length:3101322    
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  TIME_quality_control    LATITUDE       LATITUDE_quality_control
##  Min.   :1            Min.   :-43.7     Min.   :1.0             
##  1st Qu.:1            1st Qu.:-35.4     1st Qu.:8.0             
##  Median :1            Median :-32.1     Median :8.0             
##  Mean   :1            Mean   :-34.1     Mean   :7.9             
##  3rd Qu.:1            3rd Qu.:-31.6     3rd Qu.:8.0             
##  Max.   :4            Max.   :-31.5     Max.   :9.0             
##  NA's   :1514906      NA's   :1515272   NA's   :1514906         
##    LONGITUDE       LONGITUDE_quality_control      PRES        
##  Min.   :115.0     Min.   :1.0               Min.   :  0.0    
##  1st Qu.:115.3     1st Qu.:8.0               1st Qu.: 15.8    
##  Median :115.5     Median :8.0               Median : 33.9    
##  Mean   :125.6     Mean   :7.9               Mean   : 44.0    
##  3rd Qu.:136.0     3rd Qu.:8.0               3rd Qu.: 65.7    
##  Max.   :147.8     Max.   :9.0               Max.   :198.9    
##  NA's   :1515272   NA's   :1514906           NA's   :1514906  
##  PRES_quality_control     DEPTH         DEPTH_quality_control    PROFILE       
##  Min.   :1            Min.   :  0.0     Min.   :1             Min.   :    0    
##  1st Qu.:1            1st Qu.: 15.7     1st Qu.:1             1st Qu.:  662    
##  Median :1            Median : 33.7     Median :1             Median : 1557    
##  Mean   :1            Mean   : 43.6     Mean   :1             Mean   : 2315    
##  3rd Qu.:1            3rd Qu.: 65.3     3rd Qu.:1             3rd Qu.: 3928    
##  Max.   :1            Max.   :197.5     Max.   :9             Max.   :15257    
##  NA's   :1514906      NA's   :1515244   NA's   :1514906       NA's   :1514923  
##  PROFILE_quality_control     PHASE         PHASE_quality_control
##  Min.   :0               Min.   :0.0       Min.   :0            
##  1st Qu.:0               1st Qu.:1.0       1st Qu.:0            
##  Median :0               Median :1.0       Median :0            
##  Mean   :0               Mean   :2.4       Mean   :0            
##  3rd Qu.:0               3rd Qu.:4.0       3rd Qu.:0            
##  Max.   :9               Max.   :4.0       Max.   :9            
##  NA's   :1514906         NA's   :1514923   NA's   :1514906      
##       TEMP         TEMP_quality_control      PSAL         PSAL_quality_control
##  Min.   :12.6      Min.   :0            Min.   :32.4      Min.   :0           
##  1st Qu.:16.1      1st Qu.:1            1st Qu.:35.2      1st Qu.:1           
##  Median :18.9      Median :1            Median :35.3      Median :1           
##  Mean   :18.2      Mean   :1            Mean   :35.4      Mean   :1           
##  3rd Qu.:20.3      3rd Qu.:1            3rd Qu.:35.7      3rd Qu.:1           
##  Max.   :24.1      Max.   :1            Max.   :36.2      Max.   :4           
##  NA's   :1514906   NA's   :1514906      NA's   :1514906   NA's   :1514906     
##       DOX1         DOX1_quality_control      DOX2         DOX2_quality_control
##  Min.   :178.1     Min.   :0            Min.   :176.6     Min.   :0.0         
##  1st Qu.:192.4     1st Qu.:1            1st Qu.:188.2     1st Qu.:1.0         
##  Median :201.0     Median :1            Median :196.7     Median :1.0         
##  Mean   :206.4     Mean   :1            Mean   :202.0     Mean   :1.9         
##  3rd Qu.:216.3     3rd Qu.:1            3rd Qu.:211.0     3rd Qu.:1.0         
##  Max.   :264.2     Max.   :1            Max.   :258.4     Max.   :9.0         
##  NA's   :1514906   NA's   :1514906      NA's   :1641791   NA's   :1514906     
##       CPHL         CPHL_quality_control      CDOM         CDOM_quality_control
##  Min.   : 0.0      Min.   :0.0          Min.   :  0.0     Min.   :0.0         
##  1st Qu.: 0.4      1st Qu.:1.0          1st Qu.:  0.4     1st Qu.:1.0         
##  Median : 0.5      Median :1.0          Median :  0.7     Median :1.0         
##  Mean   : 0.6      Mean   :1.1          Mean   :  0.8     Mean   :1.1         
##  3rd Qu.: 0.7      3rd Qu.:1.0          3rd Qu.:  1.1     3rd Qu.:1.0         
##  Max.   :15.9      Max.   :4.0          Max.   :242.8     Max.   :4.0         
##  NA's   :1514906   NA's   :1514906      NA's   :1514906   NA's   :1514906     
##       CNDC         CNDC_quality_control      VBSC         VBSC_quality_control
##  Min.   :4.0       Min.   :0            Min.   :0         Min.   :0.0         
##  1st Qu.:4.5       1st Qu.:1            1st Qu.:0         1st Qu.:1.0         
##  Median :4.7       Median :1            Median :0         Median :1.0         
##  Mean   :4.7       Mean   :1            Mean   :0         Mean   :1.3         
##  3rd Qu.:4.9       3rd Qu.:1            3rd Qu.:0         3rd Qu.:1.0         
##  Max.   :5.3       Max.   :4            Max.   :0         Max.   :4.0         
##  NA's   :1514906   NA's   :1514906      NA's   :1514906   NA's   :1514906     
##       NTRA         NTRA_quality_control      UCUR         UCUR_quality_control
##  Min.   : NA       Min.   : NA          Min.   :-0.3      Min.   :0           
##  1st Qu.: NA       1st Qu.: NA          1st Qu.:-0.1      1st Qu.:9           
##  Median : NA       Median : NA          Median : 0.0      Median :9           
##  Mean   :NaN       Mean   :NaN          Mean   : 0.0      Mean   :9           
##  3rd Qu.: NA       3rd Qu.: NA          3rd Qu.: 0.1      3rd Qu.:9           
##  Max.   : NA       Max.   : NA          Max.   : 0.3      Max.   :9           
##  NA's   :3101322   NA's   :3101322      NA's   :3100545   NA's   :1514906     
##       VCUR         VCUR_quality_control      HEAD         HEAD_quality_control
##  Min.   :-0.5      Min.   :0            Min.   :  0.0     Min.   :0           
##  1st Qu.:-0.1      1st Qu.:9            1st Qu.:104.3     1st Qu.:0           
##  Median : 0.0      Median :9            Median :166.5     Median :0           
##  Mean   : 0.0      Mean   :9            Mean   :174.3     Mean   :0           
##  3rd Qu.: 0.1      3rd Qu.:9            3rd Qu.:262.9     3rd Qu.:0           
##  Max.   : 0.4      Max.   :9            Max.   :359.9     Max.   :9           
##  NA's   :3100545   NA's   :1514906      NA's   :1516573   NA's   :1514906     
##     UCUR_GPS       UCUR_GPS_quality_control    VCUR_GPS      
##  Min.   :-0.6      Min.   :0                Min.   :-0.9     
##  1st Qu.:-0.2      1st Qu.:9                1st Qu.:-0.1     
##  Median :-0.1      Median :9                Median : 0.1     
##  Mean   :-0.1      Mean   :9                Mean   : 0.0     
##  3rd Qu.: 0.1      3rd Qu.:9                3rd Qu.: 0.2     
##  Max.   : 0.4      Max.   :9                Max.   : 0.6     
##  NA's   :3100545   NA's   :1514906          NA's   :3100545  
##  VCUR_GPS_quality_control    IRRAD443       IRRAD443_quality_control
##  Min.   :0                Min.   :  0.0     Min.   :1.0             
##  1st Qu.:9                1st Qu.:  0.0     1st Qu.:1.0             
##  Median :9                Median :  0.0     Median :1.0             
##  Mean   :9                Mean   :  6.3     Mean   :2.4             
##  3rd Qu.:9                3rd Qu.:  3.1     3rd Qu.:4.0             
##  Max.   :9                Max.   :321.8     Max.   :4.0             
##  NA's   :1514906          NA's   :1514906   NA's   :1514906         
##     IRRAD490       IRRAD490_quality_control    IRRAD555      
##  Min.   :  0.0     Min.   :1.0              Min.   :  0.0    
##  1st Qu.:  0.0     1st Qu.:1.0              1st Qu.:  0.0    
##  Median :  0.0     Median :1.0              Median :  0.0    
##  Mean   :  7.7     Mean   :2.4              Mean   :  4.5    
##  3rd Qu.:  4.8     3rd Qu.:4.0              3rd Qu.:  1.1    
##  Max.   :334.4     Max.   :4.0              Max.   :331.1    
##  NA's   :1514906   NA's   :1514906          NA's   :1514906  
##  IRRAD555_quality_control    IRRAD670       IRRAD670_quality_control
##  Min.   :1.0              Min.   :  0.0     Min.   :1.0             
##  1st Qu.:1.0              1st Qu.:  0.0     1st Qu.:1.0             
##  Median :1.0              Median :  0.0     Median :1.0             
##  Mean   :2.4              Mean   :  1.5     Mean   :2.4             
##  3rd Qu.:4.0              3rd Qu.:  0.0     3rd Qu.:4.0             
##  Max.   :4.0              Max.   :326.0     Max.   :4.0             
##  NA's   :1514906          NA's   :1514906   NA's   :1514906         
##      geom          
##  Length:3101322    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##